load("nmes.rdata")
d3 <- nmes
d <- subset(d3, lastage==65 & eversmk %in% c(0,1))
d$ever <- as.numeric(d$eversmk)
d <- subset(nmes, lastage >= 65 & eversmk %in% c(0, 1))
d$agem65 <- d$lastage- 65
d$age_sp1 <- pmax(0, d$lastage- 75)
d$age_sp2 <- pmax(0, d$lastage- 85)
d$ever <- ifelse(d$eversmk== 1, 1, 0)
model_1 <- lm(totalexp~ agem65 + age_sp1 + age_sp2 + ever +
agem65:ever + age_sp1:ever + age_sp2:ever, data = d)
summary(model_1)
##
## Call:
## lm(formula = totalexp ~ agem65 + age_sp1 + age_sp2 + ever + agem65:ever +
## age_sp1:ever + age_sp2:ever, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9846 -3739 -2838 -882 171074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2445.18 469.31 5.210 1.97e-07 ***
## agem65 161.72 73.38 2.204 0.0276 *
## age_sp1 -102.24 140.94 -0.725 0.4682
## age_sp2 546.81 257.02 2.127 0.0334 *
## ever 1513.54 624.50 2.424 0.0154 *
## agem65:ever -140.64 100.12 -1.405 0.1602
## age_sp1:ever 261.66 206.39 1.268 0.2049
## age_sp2:ever -964.30 463.21 -2.082 0.0374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10040 on 4720 degrees of freedom
## Multiple R-squared: 0.008323, Adjusted R-squared: 0.006852
## F-statistic: 5.659 on 7 and 4720 DF, p-value: 1.591e-06
d$residuals <- residuals(model_1)
d$residuals_scaled <- d$residuals / 1000
ggplot(d, aes(totalexp)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histogram of the Outcome Variable (totalexp)",
subtitle = "Medical expenditure is right-skewed")
ggplot(d, aes(x = lastage, y = residuals_scaled, color = factor(ever))) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
geom_hline(yintercept = min(d$residuals_scaled), color = "green")+
labs(x = "Age", y = "Scaled Residuals (per 1000 dollars)", color = "Ever Smoker") +
scale_color_manual(values = c("0" = "brown", "1" = "purple")) +
facet_wrap(~ factor(ever), ncol = 1) +
labs(title = "Residuals vs. Age for Ever and Never Smokers",
subtitle = "Residuals are not normally distribution") +
theme_minimal() +
coord_cartesian(ylim = c(min(d$residuals_scaled) - 50, max(d$residuals_scaled)))
The minimum residual is -9.845, while the largest residual is > 100, and residuals are skewed to the negative. \(residual=observed−predicted\) The lower bound to the residuals at each year of age is due to the non-negativity of expenditures. If the model predicts a negative expenditure, the residual \(residual=observed + |predicted|\). This may explain those very large residuals. If the model predicts a positive expenditure, the residual \(residual=observed−predicted\) is bounded below by \(−predicted\) and the minimum is achieved when \(observed = 0\).
res_data <- d %>%
group_by(lastage, ever) %>%
summarise(mean_resid = mean(residuals_scaled), .groups = "drop")
ggplot(d, aes(x = lastage, y = residuals_scaled)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, color = "blue", linewidth = 0.7, linetype = "dashed") +
geom_line(data = res_data,
aes(y = mean_resid),
color = "red",
linewidth = 1) +
facet_wrap(~ever,
labeller = labeller(ever = c("0" = "Never Smokers", "1" = "Ever Smokers"))) +
coord_cartesian(ylim = c(-10, 20)) +
labs(x = "Age", y = "Residual (per $1000)",
title = "Residual vs. age plots") +
theme_minimal()
### c The residuals roughly have mean 0 for age ranges from 65 to 85
(slightly off around age = 70) for both never and ever smokers. For
never smokers, the residual mean rises to ~ 1 when age is around 88,
drops to ~ 0 when age is around 90, and finally rises to ~ 2 when age is
around 95. For ever smokers, the residual mean rises to ~ 1 when age is
around 88, drops to ~ -1 when age is around 90, and finally rises to ~ 4
when age is around 95.
To account for the slight curvature around age = 70, I add a bs(age_sp70, degree = 2) term. I also add bs(age_sp2, degree = 4) and bs(age_sp3, degree = 4) to account for the rise and fall in residuals when age is above 85.
d$age_sp70 <- pmax(0, d$lastage - 70)
d$age_sp3 <- pmax(0, d$lastage - 90)
model_unadjusted <- lm(totalexp ~ agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) +
bs(age_sp3, degree = 4) + ever +
ever*(agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4)),
data = d)
d$resid_unadj <- residuals(model_unadjusted) / 1000
res_data_unadj <- d %>%
group_by(lastage, ever) %>%
summarise(mean_resid = mean(resid_unadj), .groups = "drop")
ggplot(d, aes(x = lastage, y = resid_unadj)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, color = "blue", linewidth = 0.7, linetype = "dashed") +
geom_line(data = res_data_unadj,
aes(y = mean_resid),
color = "red",
linewidth = 1) +
facet_wrap(~ever,
labeller = labeller(ever = c("0" = "Never Smokers", "1" = "Ever Smokers"))) +
coord_cartesian(ylim = c(-10, 20)) +
labs(x = "Age", y = "Residual (per $1000)",
title = "Residual vs. age plots")+
theme_minimal()
pred_grid <- expand.grid(
lastage = 65:94,
ever = c(0, 1)
)
pred_grid$agem65 <- pred_grid$lastage - 65
pred_grid$age_sp70 <- pmax(0, pred_grid$lastage - 70)
pred_grid$age_sp2 <- pmax(0, pred_grid$lastage - 85)
pred_grid$age_sp3 <- pmax(0, pred_grid$lastage - 90)
preds <- predict(model_unadjusted, newdata = pred_grid)
diff <- data.frame(
age = 65:94,
difference = preds[pred_grid$ever == 1] - preds[pred_grid$ever == 0],
difference_scaled = (preds[pred_grid$ever == 1] - preds[pred_grid$ever == 0])/1000
)
ggplot(diff, aes(x = age, y = difference_scaled)) +
geom_line(color = "darkblue", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey") +
labs(
x = "Age",
y = "Difference (per 1000 dollars)",
title = "Unadjusted Smoking Effect by Age",
subtitle = "Difference in Mean Expenditures (Ever - Never Smokers)",
) +
theme_minimal()
### b
d <- d %>%
mutate(
sex = as.factor(male),
race = as.factor(RACE3),
beltuse = as.factor(beltuse),
educate = as.factor(educate),
marital = as.factor(marital),
sregion = as.factor(sregion),
povstalb = as.factor(povstalb)
)
model_adjusted <- lm(
totalexp ~ agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4) + ever +
ever*(agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4)) +
beltuse + educate + marital + povstalb,
data = d
)
adjusted_diff <- avg_comparisons(
model_adjusted,
variables = "ever",
by = "lastage",
newdata = subset(d, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, adjusted = estimate) %>%
mutate(adjusted_difference_scaled = adjusted/1000)
plot_data <- diff %>%
left_join(adjusted_diff %>% select(age, adjusted_difference_scaled), by = "age")
ggplot(plot_data, aes(x = age)) +
geom_line(aes(y = difference_scaled, color = "Unadjusted"), linewidth = 1) +
geom_line(aes(y = adjusted_difference_scaled, color = "Adjusted"), linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey") +
labs(
x = "Age",
y = "Difference (per 1000 dollars)",
title = "Unadjusted and Adjusted Smoking Effect by Age",
subtitle = "Difference in Mean Expenditures (Ever - Never Smokers)",
) +
scale_color_manual(values = c("Unadjusted" = "darkblue", "Adjusted" = "red")) +
theme_bw()
The plot shows that the estimated differences of expenditures predicted by the adjusted and unadjusted model are changing with age.
The adjusted estimates do not differ from the unadjusted estimates by much as the estimated difference in mean expenditure for ever and never smokers ages from 65 to 94 is almost the same. Only when age is around 87-92 the estimates show some differences. However, this might be caused by the residuals do not have mean of exactly 0 for this range of age, as we can see in the plot of 1c).
So, these variables are either not strongly associated with both smoking status and mean expenditures in this population, or their effects on expenditures are balanced between smokers and non-smokers, and therefore, they are not helpful in predicting the mean expenditures.
d$squared_resid_unadj <- (residuals(model_unadjusted) / 1000)^2
d$squared_resid_adj <- (residuals(model_adjusted) / 1000)^2
plot_unadj <- ggplot(d, aes(x = lastage, y = squared_resid_unadj)) +
geom_point(alpha = 0.4) +
geom_smooth(
method = "loess",
color = "darkred",
linewidth = 1,
se = FALSE
) +
facet_wrap(
~ever,
labeller = labeller(ever = c("0" = "Never Smokers", "1" = "Ever Smokers"))
) +
coord_cartesian(ylim = c(0, 1000)) +
labs(
x = "Age",
y = "Squared Residuals (per $1000²)",
title = "Unadjusted Model: Residual Variance by Age"
) +
theme_minimal()
plot_adj <- ggplot(d, aes(x = lastage, y = squared_resid_adj)) +
geom_point(alpha = 0.4) +
geom_smooth(
method = "loess",
color = "darkblue",
linewidth = 1,
se = FALSE
) +
facet_wrap(
~ever,
labeller = labeller(ever = c("0" = "Never Smokers", "1" = "Ever Smokers"))
) +
coord_cartesian(ylim = c(0, 1000)) +
labs(
x = "Age",
y = "Squared Residuals (per $1000²)",
title = "Adjusted Model: Residual Variance by Age"
) +
theme_minimal()
print(plot_unadj)
## `geom_smooth()` using formula = 'y ~ x'
print(plot_adj)
## `geom_smooth()` using formula = 'y ~ x'
For both unadjusted model and adjusted model, and for both ever smokers and never smokers, the residual variance is not constant with age. Also, it might not be reasonable to assume the residual variance as a function of age is same for ever and never smokers. It is because for never smokers, the residual variance is higher for age 70-75; for ever smokers, the residual variance is higher for age 65-70.
unadjusted_ci <- avg_comparisons(
model_unadjusted,
variables = "ever",
by = "lastage",
newdata = subset(d, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, difference = estimate) %>%
mutate(
ci_lower_unadj = difference - qnorm(0.975) * std.error,
ci_upper_unadj = difference + qnorm(0.975) * std.error
)
adjusted_ci <- avg_comparisons(
model_adjusted,
variables = "ever",
by = "lastage",
newdata = subset(d, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, difference = estimate) %>%
mutate(
ci_lower_adj = difference - qnorm(0.975) * std.error,
ci_upper_adj = difference + qnorm(0.975) * std.error
)
ci_data <- unadjusted_ci %>%
select(age, difference_unadj = difference, ci_lower_unadj, ci_upper_unadj) %>%
left_join(
adjusted_ci %>% select(age, difference_adj = difference, ci_lower_adj, ci_upper_adj),
by = "age"
)
unadjusted_table <- ci_data[, c("age", "difference_unadj", "ci_lower_unadj", "ci_upper_unadj")]
adjusted_table <- ci_data[, c("age", "difference_adj", "ci_lower_adj", "ci_upper_adj")]
knitr::kable(head(unadjusted_table, 5),
caption = "Adjusted Model Summary Table",
col.names = c("Age", "Difference", "Lower CI", "Upper CI"))
| Age | Difference | Lower CI | Upper CI |
|---|---|---|---|
| 65 | 2565.97214 | 1007.7063 | 4124.238 |
| 66 | 1948.38033 | 759.9255 | 3136.835 |
| 67 | 1330.78853 | 407.7011 | 2253.876 |
| 68 | 713.19672 | -151.2067 | 1577.600 |
| 69 | 95.60491 | -952.1218 | 1143.332 |
knitr::kable(head(adjusted_table, 5),
caption = "Adjusted Model Summary Table",
col.names = c("Age", "Difference", "Lower CI", "Upper CI"))
| Age | Difference | Lower CI | Upper CI |
|---|---|---|---|
| 65 | 2455.23546 | 905.4684 | 4005.003 |
| 66 | 1854.74128 | 672.2307 | 3037.252 |
| 67 | 1254.24710 | 334.9660 | 2173.528 |
| 68 | 653.75292 | -207.6472 | 1515.153 |
| 69 | 53.25874 | -990.3558 | 1096.873 |
ggplot(ci_data, aes(x = age)) +
geom_ribbon(aes(ymin = ci_lower_unadj, ymax = ci_upper_unadj),
fill = "blue", alpha = 0.2) +
geom_line(aes(y = difference_unadj, color = "Unadjusted"), linewidth = 1) +
geom_ribbon(aes(ymin = ci_lower_adj, ymax = ci_upper_adj),
fill = "red", alpha = 0.2) +
geom_line(aes(y = difference_adj, color = "Adjusted"), linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
labs(
x = "Age",
y = "Difference in Mean Expenditures (Ever - Never Smokers)",
title = "Model-based 95% CIs for Difference",
color = "Model"
) +
scale_color_manual(values = c("Unadjusted" = "darkblue", "Adjusted" = "red"))+
theme_minimal()
compute_difference <- function(data, indices) {
d_boot <- data[indices, ]
model_boot <- lm(
totalexp ~ agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4) + ever +
ever*(agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4)),
data = d_boot
)
pred_grid <- expand.grid(
lastage = 65:94,
ever = c(0, 1)
)
pred_grid$agem65 <- pred_grid$lastage - 65
pred_grid$age_sp70 <- pmax(0, pred_grid$lastage - 70)
pred_grid$age_sp2 <- pmax(0, pred_grid$lastage - 85)
pred_grid$age_sp3 <- pmax(0, pred_grid$lastage - 90)
preds <- predict(model_boot, newdata = pred_grid)
diff <- preds[pred_grid$ever == 1] - preds[pred_grid$ever == 0]
return(diff)
}
set.seed(35)
#boot_results <- boot(
# data = d,
# statistic = compute_difference,
# R = 1000)
# saveRDS(boot_results, file = "boot_results.rds")
boot_results <- readRDS(file = "boot_results.rds")
n_ages <- length(65:94)
boot_ci <- sapply(1:n_ages, function(x) boot.ci(boot_results, index = x, type = "perc")$percent[4:5])
ci_bootstrap <- data.frame(
age = 65:94,
ci_lower = boot_ci[1, ],
ci_upper = boot_ci[2, ]
)
boot_data_unadjusted <- merge(diff, ci_bootstrap, by = "age")
ggplot(boot_data_unadjusted, aes(x = age, y = difference)) +
geom_line(color = "darkblue", linewidth = 1) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper),
fill = "blue", alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
x = "Age",
y = "Difference in Mean Expenditures (Ever - Never Smokers)",
title = "Bootstrap 95% CIs for Difference",
subtitle = "Using Unadjusted Model"
)
compute_difference_adjusted <- function(data, indices) {
d_boot <- data[indices, ]
model_boot <- lm(
totalexp ~ agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4) + ever +
ever*(agem65 + bs(age_sp70, degree = 2) +
bs(age_sp2, degree = 4) + bs(age_sp3, degree = 4)) +
beltuse + educate + marital + povstalb,
data = d_boot
)
pred_grid <- expand.grid(
lastage = 65:94,
ever = c(0, 1),
beltuse = factor(1, levels = levels(d$beltuse)),
educate = factor(1, levels = levels(d$educate)),
marital = factor(1, levels = levels(d$marital)),
povstalb = factor(1, levels = levels(d$povstalb))
)
pred_grid$agem65 <- pred_grid$lastage - 65
pred_grid$age_sp70 <- pmax(0, pred_grid$lastage - 70)
pred_grid$age_sp2 <- pmax(0, pred_grid$lastage - 85)
pred_grid$age_sp3 <- pmax(0, pred_grid$lastage - 90)
preds <- predict(model_boot, newdata = pred_grid)
diff <- preds[pred_grid$ever == 1] - preds[pred_grid$ever == 0]
return(diff)
}
set.seed(35)
#boot_results_adjusted <- boot(
# data = d,
# statistic = compute_difference_adjusted,
# R = 1000
#)
#saveRDS(boot_results_adjusted, file = "boot_results_adjusted.rds")
boot_results_adjusted <- readRDS(file = "boot_results_adjusted.rds")
boot_ci_adjusted <- sapply(1:n_ages, function(x) {
boot.ci(boot_results_adjusted, index = x, type = "perc")$percent[4:5]
})
ci_bootstrap_adjusted <- data.frame(
age = 65:94,
ci_lower = boot_ci_adjusted[1, ],
ci_upper = boot_ci_adjusted[2, ]
)
boot_data_adjusted <- merge(adjusted_diff, ci_bootstrap_adjusted, by = "age")
ggplot(boot_data_adjusted, aes(x = age, y = adjusted)) +
geom_line(color = "orange", linewidth = 1) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper),
fill = "brown", alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
x = "Age",
y = "Difference in Mean Expenditures (Ever - Never Smokers)",
title = "Bootstrap 95% CIs for Difference",
subtitle = "Using Adjusted Model"
)
boot_unadjusted_table <- boot_data_unadjusted[, c("age", "difference", "ci_lower", "ci_upper")]
boot_adjusted_table <- boot_data_adjusted[, c("age", "adjusted", "ci_lower", "ci_upper")]
knitr::kable(head(boot_unadjusted_table, 5),
caption = "Adjusted Model Summary Table, Bootstrap",
col.names = c("Age", "Difference", "Lower CI", "Upper CI"))
| Age | Difference | Lower CI | Upper CI |
|---|---|---|---|
| 65 | 2565.97214 | 1216.4896 | 4092.390 |
| 66 | 1948.38033 | 841.4495 | 3103.589 |
| 67 | 1330.78853 | 443.0864 | 2289.472 |
| 68 | 713.19672 | -100.5035 | 1565.988 |
| 69 | 95.60491 | -939.8980 | 1031.951 |
knitr::kable(head(boot_adjusted_table, 5),
caption = "Adjusted Model Summary Table, Bootstrap",
col.names = c("Age", "Difference", "Lower CI", "Upper CI"))
| Age | Difference | Lower CI | Upper CI |
|---|---|---|---|
| 65 | 2455.23546 | 1068.5647 | 3984.6030 |
| 66 | 1854.74128 | 726.0549 | 3029.9580 |
| 67 | 1254.24710 | 382.3363 | 2223.2648 |
| 68 | 653.75292 | -171.8128 | 1491.3510 |
| 69 | 53.25874 | -1013.4762 | 994.8676 |
# unadjusted
hat_unadj <- hatvalues(model_unadjusted)
p_unadj <- length(coef(model_unadjusted))
n_unadj <- nobs(model_unadjusted)
mean_leverage_unadj <- p_unadj / n_unadj
threshold_unadj <- 2 * mean_leverage_unadj
prop_high_leverage_unadj <- mean(hat_unadj > threshold_unadj)
# adjusted
hat_adj <- hatvalues(model_adjusted)
p_adj <- length(coef(model_adjusted))
n_adj <- nobs(model_adjusted)
mean_leverage_adj <- p_adj / n_adj
threshold_adj <- 2 * mean_leverage_adj
prop_high_leverage_adj <- mean(hat_adj > threshold_adj)
cat("Unadjusted Model:\n",
"Proportion of high-leverage points:", round(prop_high_leverage_unadj, 3))
## Unadjusted Model:
## Proportion of high-leverage points: 0.055
cat("\nAdjusted Model:\n",
"Proportion of high-leverage points:", round(prop_high_leverage_adj, 3))
##
## Adjusted Model:
## Proportion of high-leverage points: 0.073
# unadjusted
dfbetas_unadj <- dfbetas(model_unadjusted)
dffits_unadj <- dffits(model_unadjusted)
n_unadj <- nobs(model_unadjusted)
p_unadj <- length(coef(model_unadjusted))
dfbetas_threshold_unadj <- 2 / sqrt(n_unadj)
dffits_threshold_unadj <- 2 * sqrt(p_unadj / n_unadj)
prop_dfbetas_unadj <- mean(apply(abs(dfbetas_unadj), 1, function(x) any(x > dfbetas_threshold_unadj)))
prop_dffits_unadj <- mean(abs(dffits_unadj) > dffits_threshold_unadj)
# adjusted
dfbetas_adj <- dfbetas(model_adjusted)
dffits_adj <- dffits(model_adjusted)
n_adj <- nobs(model_adjusted)
p_adj <- length(coef(model_adjusted))
dfbetas_threshold_adj <- 2 / sqrt(n_adj)
dffits_threshold_adj <- 2 * sqrt(p_adj / n_adj)
prop_dfbetas_adj <- mean(apply(abs(dfbetas_adj), 1, function(x) any(x > dfbetas_threshold_adj)))
prop_dffits_adj <- mean(abs(dffits_adj) > dffits_threshold_adj)
cat("Unadjusted Model:\n",
"Proportion exceeding DFBETAS threshold:", round(prop_dfbetas_unadj, 3), "\n",
"Proportion exceeding DFFITS threshold:", round(prop_dffits_unadj, 3))
## Unadjusted Model:
## Proportion exceeding DFBETAS threshold: 0.087
## Proportion exceeding DFFITS threshold: 0.031
cat("\nAdjusted Model:\n",
"Proportion exceeding DFBETAS threshold:", round(prop_dfbetas_adj, 3), "\n",
"Proportion exceeding DFFITS threshold:", round(prop_dffits_adj, 3))
##
## Adjusted Model:
## Proportion exceeding DFBETAS threshold: 0.107
## Proportion exceeding DFFITS threshold: 0.036
ols_plot_dfbetas(model_unadjusted)
ols_plot_dfbetas(model_adjusted)
ols_plot_dffits(model_unadjusted)
ols_plot_dffits(model_adjusted)
# unadjusted
leverage_threshold_unadj <- 2 * mean(hatvalues(model_unadjusted))
removed_obs_unadj <- which(
abs(dffits_unadj) > dffits_threshold_unadj |
apply(abs(dfbetas_unadj), 1, max) > dfbetas_threshold_unadj |
hatvalues(model_unadjusted) > leverage_threshold_unadj
)
# adjusted
leverage_threshold_adj <- 2 * mean(hatvalues(model_adjusted))
removed_obs_adj <- which(
abs(dffits_adj) > dffits_threshold_adj |
apply(abs(dfbetas_adj), 1, max) > dfbetas_threshold_adj |
hatvalues(model_adjusted) > leverage_threshold_adj
)
length(removed_obs_unadj)
## [1] 483
length(removed_obs_adj)
## [1] 626
d_clean_unadj <- d[-removed_obs_unadj, ]
d_clean_adj <- d[-removed_obs_adj, ]
model_unadjusted_clean <- update(model_unadjusted, data = d_clean_unadj)
model_adjusted_clean <- update(model_adjusted, data = d_clean_adj)
preds_clean <- predict(model_unadjusted_clean, newdata = pred_grid)
diff_clean <- data.frame(
age = 65:94,
difference = preds_clean[pred_grid$ever == 1] - preds_clean[pred_grid$ever == 0],
difference_scaled = (preds_clean[pred_grid$ever == 1] - preds_clean[pred_grid$ever == 0])/1000
)
adjusted_diff_clean <- avg_comparisons(
model_adjusted_clean,
variables = "ever",
by = "lastage",
newdata = subset(d_clean_adj, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, adjusted = estimate) %>%
mutate(adjusted_difference_scaled = adjusted/1000)
unadjusted_ci_clean <- avg_comparisons(
model_unadjusted_clean,
variables = "ever",
by = "lastage",
newdata = subset(d_clean_unadj, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, difference = estimate) %>%
mutate(
ci_lower_unadj = difference - qnorm(0.975) * std.error,
ci_upper_unadj = difference + qnorm(0.975) * std.error
)
adjusted_ci_clean <- avg_comparisons(
model_adjusted_clean,
variables = "ever",
by = "lastage",
newdata = subset(d_clean_adj, lastage %in% 65:94)
) %>%
as.data.frame() %>%
rename(age = lastage, difference = estimate) %>%
mutate(
ci_lower_adj = difference - qnorm(0.975) * std.error,
ci_upper_adj = difference + qnorm(0.975) * std.error
)
ci_data_clean <- unadjusted_ci_clean %>%
select(age, difference_unadj = difference, ci_lower_unadj, ci_upper_unadj) %>%
left_join(
adjusted_ci_clean %>% select(age, difference_adj = difference, ci_lower_adj, ci_upper_adj),
by = "age"
)
plot_unadj <- ggplot() +
geom_ribbon(data = ci_data,
aes(x = age, ymin = ci_lower_unadj, ymax = ci_upper_unadj),
fill = "lightblue") +
geom_line(data = ci_data,
aes(x = age, y = difference_unadj, color = "Full"),
size = 1.2, linetype = "dashed") +
geom_ribbon(data = ci_data_clean,
aes(x = age, ymin = ci_lower_unadj, ymax = ci_upper_unadj),
fill = "darkblue", alpha = 0.6) +
geom_line(data = ci_data_clean,
aes(x = age, y = difference_unadj, color = "Clean"),
size = 1.2, linetype = "solid") +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_color_manual(name = "Unadjusted Model",
values = c("Full" = "lightblue", "Clean" = "darkblue")) +
labs(x = "Age", y = "Difference in Mean Expenditures ($)",
title = "Unadjusted Model") +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_adj <- ggplot() +
geom_ribbon(data = ci_data,
aes(x = age, ymin = ci_lower_adj, ymax = ci_upper_adj),
fill = "pink") +
geom_line(data = ci_data,
aes(x = age, y = difference_adj, color = "Full"),
size = 1.2, linetype = "dashed") +
geom_ribbon(data = ci_data_clean,
aes(x = age, ymin = ci_lower_adj, ymax = ci_upper_adj),
fill = "red", alpha = 0.6) +
geom_line(data = ci_data_clean,
aes(x = age, y = difference_adj, color = "Clean"),
size = 1.2, linetype = "solid") +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_color_manual(name = "Adjusted Model",
values = c("Full" = "pink", "Clean" = "red")) +
labs(x = "Age", y = "Difference in Mean Expenditures ($)",
title = "Adjusted Model") +
theme(legend.position = "bottom")
plot_unadj
plot_adj
ci_data_display <- ci_data %>%
mutate(
ci_95_unadjusted = paste0("(", round(ci_lower_unadj, 3), ", ", round(ci_upper_unadj, 3), ")"),
ci_95_adjusted = paste0("(", round(ci_lower_adj, 3), ", ", round(ci_upper_adj, 3), ")")
) %>% select("age", "difference_unadj", "ci_95_unadjusted", "difference_adj", "ci_95_adjusted")
ci_data_clean_display <- ci_data_clean %>%
mutate(
ci_95_unadjusted = paste0("(", round(ci_lower_unadj, 3), ", ", round(ci_upper_unadj, 3), ")"),
ci_95_adjusted = paste0("(", round(ci_lower_adj, 3), ", ", round(ci_upper_adj, 3), ")")
) %>% select("age", "difference_unadj", "ci_95_unadjusted", "difference_adj", "ci_95_adjusted")
knitr::kable(head(ci_data_display, 5),
caption = "Comparison of Model Fits, Data Before Clean",
align = "c",
col.names = c("Age",
"Unadjusted",
"95% CI",
"Adjusted",
"95% CI"))
| Age | Unadjusted | 95% CI | Adjusted | 95% CI |
|---|---|---|---|---|
| 65 | 2565.97214 | (1007.706, 4124.238) | 2455.23546 | (905.468, 4005.003) |
| 66 | 1948.38033 | (759.926, 3136.835) | 1854.74128 | (672.231, 3037.252) |
| 67 | 1330.78853 | (407.701, 2253.876) | 1254.24710 | (334.966, 2173.528) |
| 68 | 713.19672 | (-151.207, 1577.6) | 653.75292 | (-207.647, 1515.153) |
| 69 | 95.60491 | (-952.122, 1143.332) | 53.25874 | (-990.356, 1096.873) |
knitr::kable(head(ci_data_clean_display, 5),
caption = "Model Comparison, Data After Clean",
align = "c",
col.names = c("Age",
"Unadjusted",
"95% CI",
"Adjusted",
"95% CI"))
| Age | Unadjusted | 95% CI | Adjusted | 95% CI |
|---|---|---|---|---|
| 65 | 701.5749 | (115.914, 1287.236) | 500.5138 | (24.324, 976.704) |
| 66 | 629.7170 | (183.703, 1075.731) | 420.7551 | (57.704, 783.806) |
| 67 | 557.8591 | (212.271, 903.447) | 340.9964 | (59.504, 622.489) |
| 68 | 486.0012 | (162.727, 809.275) | 261.2376 | (-1.54, 524.015) |
| 69 | 414.1432 | (21.519, 806.768) | 181.4789 | (-136.716, 499.673) |
The data were filtered based on eversmk (smoking status)
lastage (subject age), with totalexp
(self-reported total medical expenditures for 1987) being the outcome
variable. ever , an indicator variable, was created to
denote smoking status (0 for never smoker; 1 for ever smoker).
A multiple linear regression (MLR) model was fitted to estimate the difference in average medical expenditures between ever and never smokers as a function of age. The model was fitted and the residuals were computed iteratively to make sure the model residuals have mean zero as age ranges from 65 to 94. The model includes B-spline terms with knots at ages 70, 75, 85, and 90, which allows the model to account for potential non-linear relationships between age and total expenditure. These knots are strategically placed to capture changes in the relationship at different life stages. To capture more complex patterns between the predictor variables, the interactions terms between age the smoking status were included to account for potential changes in the rate of medical expenditure for different smoking statuses across different age groups. To assess omitted variable bias, an adjusted model is fitted by adding the variables indicating marital status, seat belt use status, educate status, and poverty status as categorical variables to the unadjusted model.
Assuming constant variance and normality of the estimated coefficients, the difference in mean expenditures between ever and never smokers and the corresponding confidence intervals were subsequently computed and visualized in ribbon plots for both the unadjusted and adjusted models across all ages. To assess the validity of these assumptions, a sensitivity analysis was conducted for both models by comparing the bootstrap confidence intervals of the difference with the model-based confidence intervals of the difference. Specifically, 1000 bootstrap samples were generated, and 95% confidence intervals were calculated using the percentile method for both models. The estimated mean differences between ever and never smokers with the corresponding upper and lower bounds of these confidence intervals were visualized in ribbon plots.
Additionally, high-leverages and influential observations were identified using predefined standard thresholds for both adjusted and unadjusted model, and the cleaned data sets were created by excluding these observations from the original data set. Then, both models were refitted with their corresponding cleaned data sets to do the estimation. Again, the difference in mean expenditures between individuals ever and never smokers and the 95% confidence intervals were estimated for both models across all age groups. Finally, the comparisons between the full and clean models, for both adjusted and unadjusted model, were visualized in plots, accompanied by summary tables for better understanding of changes in the estimates.
In the unadjusted model, the average of difference in mean medical expenditures between ever and never smokers across individuals ages from 65 to 94 is -$427.26, with range from $-9998.31 (95% CI: -$19494.75 to -$501.87) to $3385.97 (95% CI: -$1573.12 to $9413.09). In contrast, in the adjusted model, the average of difference in mean medical expenditures between ever and never smokers across individuals ages from 65 to 94 is -$394.19, with range from $-8646.35 (95% CI: -$18099.39 to -$806.69) to $3919.99 (95% CI: -$2083.93 to $8855.88). By including additional variables such as marital and poverty status, the adjusted model showed a narrower range of differences. The reduced range and smaller standard errors in the adjusted model suggest that accounting for omitted variable bias by including these additional variables decreased the variability of the estimates.
Notably, the differences in mean expenditures between the two models were smaller for younger individuals (those below 85) compared to older individuals. This may be due to the smaller sample sizes and increased uncertainty in the estimates for older age groups, as well as potential survivorship bias.
In the sensitivity analysis, 95% bootstrap confidence intervals were compared with the model-based intervals for both models. The results showed that for younger participants (those below 85), the bootstrap and model-based intervals were relatively consistent. However, for older participants, the bootstrap confidence intervals were significantly wider. For instance, at age 94, the unadjusted model’s 95% confidence interval was from -$19283.6957 to $5213.04, whereas the bootstrap interval extended from $-42391.86 to $30409.16. This pattern suggests that the limited data for older individuals increases the level of uncertainty in estimation. Moreover, this trend points to heteroskedasticity, meaning that the variance is not constant across different ages. Thus, the linear model assuming the constant variance and normally distributed residuals may be recalibrated to better fit the data.
Lastly, high leverage and influence observations were identified using predefined thresholds. For the unadjusted model, 5.5% of the observations were classified as high leverage points, while this figure rose to 7.3% for the adjusted model. Regarding high influence points, 8.7% and 3.1% of the observations exceeded the DFBETAS and DFFITS thresholds in the unadjusted model, compared to 10.7% and 3.6% in the adjusted model. Consequently, 483 and 626 observations were excluded from the unadjusted and adjusted models, respectively, representing approximately 11% and 13% of the total data, with the most of them being the observations with age greater than 85. After removing these observations, the confidence intervals became narrower, and the estimates showed less variability, as shown in the summary table and the comparison plots between the full and cleaned models. The cleaned models, based on the remaining observations, better adhered to the model assumptions and yielded more stable estimates.
The inclusion of seat belt use, education, marital and poverty status variables in the adjusted model slightly reduce the estimated difference and the variation of the estimation in mean expenditures between ever and never smokers compared to the unadjusted model. This suggests that accounting for these variables somehow addresses omitted variable bias. However, other potential confounders not present in the dataset, such as other behavioral factors or the presence of chronic conditions, could significantly influence medical expenditures. Future research should aim to incorporate these additional variables to further refine the model. Classification of the additional variables as potential confouders or the mediators of the outcome variable should be careful.
The comparison between bootstrap and model-based confidence intervals revealed consistency among younger individuals (age<85), where the data are more abundant. It is as expected that the bootstrap confidence intervals are wider for older individuals, given the limited observations in this age group. The nature of healthcare data, characterized by non-normal and non-constant residuals, may further contribute to the increased uncertainty in these intervals.
The sensitivity analysis which excluded high leverage and influential observations, gave particularly noteworthy results. The results showed substantial reductions in the width of the 95% confidence intervals—approximately halved for certain age groups—was more significant than expected. This finding indicates that a relatively small portion of the data disproportionately influenced the uncertainty of the estimates.
These results underscore the importance of comprehensive model checking, including adjustments, bootstrapping, and sensitivity analyses, especially when dealing with data that may be highly skewed or heteroscedastic. Such methods are important to ensure the robustness of the findings in healthcare research.